Advanced and Main Stream Courses; Influence on the Mental and Physical Expectations and Experiences

Assignment Project 1 Report

1 Executive Summary

To investigate the impact of Advanced and Main Stream Courses DATA2x02 and their influence on the Mental and Physical Expectations and Experiences of students. While the population sampled is stratified in comparison to all Advanced and Main Stream Courses in USYD and their respective students. The following investigation aims to gain an insight into effects within the DATA2x02 Cohort

The main discoveries are that DATA2902 students tend to find courses increasingly difficult with higher expectations of salary. While majority of DATA2002 students’ difficulty ranks are among the extremities of easy and hard. With COVID influences in relation to their COVID tests/vaccination and location having no observed effect.

2 Introduction

2.1 Domain Kownledge

The data for this report has been sourced through an online survey hosted on Google Forms platform. With questions addressing student engagement in COVID conditions, built assumptions/expectations, behavioural and physical characteristics of survey undertakers.

Data from the survey was collected from a sample of the population of University of Sydney students undertaking the unit DATA2002/DATA2902 in their semester 2 of 2021.

2.2 Data Collection

With known population of 696 students in DATA2002 and 58 students in DATA2902. The survey data captures only a sample of the population of students undertaking the two courses.

Evidently there exists a larger sample size of DATA2002 students compared to DAT2902; if students of each subgroup of the population were chosen by the surveyor this would have introduced sampling error. As students chose to participate in the survey this would be a Non-probability/ volunteer sample hence not random. For more information : Online Data Sampling

2.3 Limitations and Bias

With consideration of the data collection methods undertaken, the data may suggest under-representation or over-representation of two course groups within the sample.Therefore the prevalence of Non-response bias and volunteer bias.

Also, with confounding factors concerning location of residence and ethical/cultural influence variables such as COVID test frequency, likelihood for exercise due to lockdown in a region. With variability in the case numbers of specific location, difficulty undertaking a course due to mental and/or physical health may be observed in analysis.

Improvements in questions relating to height, salary, major of degree would have have been beneficial. As the lack of response validation regarding units of measurement such as cm,annual salary and predefined major options respectively could’ve simplified the data cleaning process.With questions that are known/suspected to be related made mandatory to respond e.g. stress and loneliness, place of residence and the no of COVID tests taken.

3 Initial Data Analysis (IDA)

3.1 Classification & Complexity of Data

Data was loaded using the Tidyverse library accompanied by other data cleaning libraries. The resulting dataset from the survey comprised of variety of categorical data and messy input data from the lack of data validation during surveying. These include confusion of measurement units,invalid input and redundancy in values of same meaning. As existing libraries/methods were incapable of covering all cases, custom functions were used for variables such as height and salary. The resulting missing values in the dataset indicate unanswered questions (as NA) while invalid/bad input was converted to NA aswell. Categorical data columns were also factored when they followed a natural ordering.

3.2 Data Importing and Cleaning

Survey Data Loading

#Survey Data Loading

surveyData <- readr::read_csv("surveyData.csv", 
    na = c(""," ", "empty"),show_col_types = FALSE)

Dataset Cleaning - Columns and Factors

#Dataset Cleaning - Columns and Factors

clean_data<-janitor::clean_names(surveyData)

short_colnames = c("time","covid_tests","living_arrangements","height",
                "interpret_wednesday","in_aus","math_ability","r_ability",
                "data2002_rank","uni_year","webcam_freq","vac_status","fav_social_media",
                "gender","steak_pref","dominant_hand","stress_in_Wk",
                "lonely","no_emails","sign_off","salary","unit","major","exercise_wk")

colnames(clean_data)<-short_colnames

# Setting Appropriate Columns to Factors 
factor_cols = c("living_arrangements","in_aus","data2002_rank",
                "uni_year","webcam_freq","vac_status","steak_pref","dominant_hand",
                "unit")

clean_data[factor_cols] <- lapply(clean_data[factor_cols], factor)

# Fixing Order of Factor Levels

levels(clean_data$in_aus) = c("Yes", "No")

levels(clean_data$data2002_rank) = c("Easy","Standard","Difficult")
levels(clean_data$uni_year) = c("First year (not yet successfully completed 48CP)",
                                "Second year (completed at least 48CP but less than 96CP)",
                                "Third year or higher (successfully completed more than 96CP)")
levels(clean_data$webcam_freq) = c( "None of the time", "Some of the time", 
                                    "Most of the time", "All the time")
levels(clean_data$vac_status) = c("I do not want to get vaccinated","Not yet, but I want to",
                                  "I'm partially vaccinated", "I'm fully vaccinated")
levels(clean_data$steak_pref) = c("I don't eat beef","Rare","Medium-rare",
                                  "Medium-well done","Medium","Well done")
levels(clean_data$dominant_hand) = c("Left","Right","Ambidextrous")

Data Cleaning - Individual Columns

# Data Cleaning - Individual Columns

## Gender

gender_copy = clean_data$gender
# Keep values that don't match dictionary
gender_copy= suppressMessages(recode_gender(gender=gender_copy, fill = TRUE))
#table(clean_data$gender,gender_copy)
clean_data$gender = tolower(gender_copy)
clean_data$gender[(clean_data$gender == "famale") | 
                    (clean_data$gender == "woman/female") ] = "female"

## Height

clean_data$height = clean_data$height %>% 
  str_replace("\\s*feet|foot|ft\\s*", "'") %>% 
  str_replace("\\s*inches|in|''|\"\\s*", "")

feet_inches_to_cm <- function(feet_inches_to_cm) {
  if (grepl( "'", feet_inches_to_cm, fixed = TRUE)) {
    feet_inch_split = as.numeric(str_split_fixed(feet_inches_to_cm,"'",2))
    in_cm = as.character(feet_inch_split[1]*30.48 + feet_inch_split[2]*2.54)
    return(in_cm)
  }
  return(feet_inches_to_cm)
  
}

clean_data$height = suppressWarnings(sapply(clean_data$height,feet_inches_to_cm))

clean_data = clean_data %>%  
  dplyr::mutate(

    height_clean = suppressWarnings(readr::parse_number(height,trim_ws = TRUE)),
    height_clean = case_when(
      height_clean <= 2.5 ~ height_clean * 100,
      height_clean <= 9 ~ NA_real_,
      TRUE ~ height_clean
    ) 
  )


##Interpret Wednesday 
all_days = tolower(weekdays(Sys.Date()+0:6))
clean_data = clean_data %>% mutate (
  interpret_wednesday = tolower(interpret_wednesday) ,
    interpret_wednesday = case_when(
      # length of longest day in week -Wed
      nchar(interpret_wednesday) >= 9 ~ NA_character_,
      !(interpret_wednesday %in% all_days) ~ NA_character_,
      TRUE ~ interpret_wednesday
    )
)

##fav_social_media
clean_data = clean_data %>% mutate (
  fav_social_media = tolower(fav_social_media),
  fav_social_media = gsub(' [A-z ]*', '' , fav_social_media),
    fav_social_media = case_when(
      # Length of longest day in week -Wednesday
      (fav_social_media == "insta") | (fav_social_media == "ig") ~ "instagram",
      TRUE ~ fav_social_media
    )
  
)

## Salary

# Assuming Average hours worked in week 38 - https://www.fairwork.gov.au/employee-entitlements/types-of-employees/casual-part-time-and-full-time/full-time-employees 


salary_conversion = function(salary) {
  if (is.na(salary)) {
    return(salary)
  }
  salary = tolower(salary)
  
  if (grepl( "hr|hour", salary)) {
    salary = as.character(parse_number(salary) *38*52)
    
  }else if (grepl( "wk|week", salary)) {
    salary = as.character(parse_number(salary) *52)
    
  }else if (grepl( "month|mth", salary)) {
    salary = as.character(parse_number(salary) *12)
    
  }else if (endsWith(salary,"k")) {
    salary = as.character(parse_number(salary) *1000)
  }
  salary = gsub(" ", "", salary, fixed = TRUE)
  return(salary)
}

clean_data$salary = suppressWarnings(sapply(clean_data$salary,salary_conversion))

clean_data = clean_data %>%  
  dplyr::mutate(
    salary = suppressWarnings(readr::parse_number(salary,trim_ws = TRUE)))

4 Survey Data Insights

Plotting the COVID-related variables by their components we can observe many intriguing insights into to how the patterns may be associated or not.The main division is made between the residence of the survey takers to observe whether the COVID case surges in the past 2 months of NSW influence vaccination status. We observe the anti-vaccination group (in Australia) seem to have a greater spread in terms of exercise which may also translate their need to get tested less frequently evident through the lack of overlapping and size of points. Ignoring the odd responses we can treat as an outlier among the rest. While those outside Australia are more willing with the contrasting exercise responses and high COVID test frequency. Note, confounders such as lockdown, immigration, government requirements may be involved though. (OECD, 19 October 2020)

# COVID Insights plot
insight1 = clean_data %>%  na.omit() %>% ggplot() + aes(x=in_aus,y=exercise_wk,size = covid_tests, color = vac_status) + geom_jitter() + coord_flip()  + labs(title = "COVID Insights by Location", caption = "(Note: Confounders can be lockdown, immigration, government requirements)",
        x = " In Australia ?", y = ("Exercise per week (hr/s) "), size = "COVID tests (in 2 mths)" , color = "Vaccination Status" )

insight1

5 Testing and Results

5.1 Does the number of COVID tests a student has taken in the past two months follow a Poisson distribution?

Motivation - Since the COVID tests variable models the count of tests over a fixed of interval 2 months. Also since the counts therefore must be integers \(\ge\) 0 and tests are independent.

Considering the survey was conducted within the surge of COVID cases in NSW and neighbouring states, there are interesting prospects in terms of the frequency of tests taken.

Assumptions

  1. The expected frequencies, \(e_i\) = \(np_i\) \(\ge\) 5
  2. Counts are independent and counts \(\ge\) 0

Since the number of counts is \(\ge\) 0 for all cell values and their frequencies, assumption 2 is reasonable.

5.1.1 Poisson Goodness of Fit Test

5.1.1.1 Expected vs Observed Covid Tests Plot

Now checking assumption 1, we find that expected frequencies of some cells are < 5 (points below the line).

# Calculations for Poisson Test
lambda = mean(clean_data$covid_tests,na.rm = TRUE)

observed_covid_tests = as.vector(table(clean_data$covid_tests))
n = sum(observed_covid_tests)


hyp_probs = c(dpois(0:9, lambda), ppois(9, lambda, lower.tail = FALSE))
expected_covid_tests = n * hyp_probs

covid_tests_df = tibble(covid = 0:10, hyp_probs, expected_covid_tests, observed_covid_tests = c(126,  40,  16 ,  4, 5,  9,  1 ,1 ,4 ,0,2))


# Plotting Calculations for Poisson Test
p1 = covid_tests_df %>%
    ggplot() + aes(x = covid) + geom_col(aes(y = observed_covid_tests, alpha = 0.5)) + theme_bw() +
    geom_point(aes(y = expected_covid_tests), color = ifelse(expected_covid_tests<5, 'red', 'green')) + scale_colour_manual(labels = c("<5", ">5"), values=c('green', 'red')) + geom_hline(yintercept = 5)   + labs(title = "Expected Frequency >= 5",
        x = " Covid Tests", y = ("Observed Covid Tests "))          
plotly::ggplotly(p1)

5.1.1.2 After Amalgamating

As we amalgamate the categories with small expected cell counts and the new expected values satisfy assumption 1.

# Amalgamate the categories with small expected cell counts:

hyp_probs2 = c(hyp_probs[1:3],sum(hyp_probs[4:11]))
observed_covid_tests[4] = sum(observed_covid_tests[4:length(observed_covid_tests)])
observed_covid_tests = observed_covid_tests[1:4]

expected_covid_tests2 = n * hyp_probs2

t0 = sum ((observed_covid_tests - expected_covid_tests2)^2/expected_covid_tests2) 

covid_tests =  c("0","1","2","3+")
al = tibble(covid_tests,observed_covid_tests,expected_covid_tests2 )
colnames(al) = c("Covid Tests","Observed Covid Tests", "Expected Covid Tests")
gt(al)
Covid Tests Observed Covid Tests Expected Covid Tests
0 126 74.34318
1 40 76.48769
2 16 39.34703
3+ 26 17.82209

Hypothesis Framework

1: Hypothesis: \(H_0:\) the data come from a Poisson distribution vs \(H_1:\) the data do not come from a Poisson distribution.

2: Assumption: The expected frequencies, \(e_i\) = \(np_i\) \(\ge\) 5 and Observations are independent

3: Test statistic: \(T=\sum_{i=1}^k \frac{(Y_i-n p_i)^2}{n p_i}\) Under \(H_0, T \sim \chi_{2}^2\)

4: Observed test statistic: \(t_0=\) 70.91

5: p-value: \(P(T \ge t_0) = P(\chi_{2}^2 \ge\) 70.9 ) = 4.4408921^{-16}.

6: Decision: Since the p-value is less than 0.05, we reject the null hypothesis. The COVID tests data is not consistent with the null hypothesis that the data follows a Poisson distribution.

5.2 Question 2 - Is there any association between course and the experienced diffculty?

Motivation - Since common conception that harder advanced courses are expected to therefore be greater in difficulty. A test of independence is viable for identifying whether the course and its respective difficulty are associated.

Also, as the survey was given two the population of students doing DATA2X02, the resulting sub groups are from the same population satisfying requirements for a independence test.

5.2.1 Setup

5.2.1.1 Contingency Table for Course Vs Rank

2x3 contingency table for Course Vs Rank presents from first impression that we have DATA2001 course ranks reasonably balanced by the two extremes while DATA2901 is left skewed with more students finding it difficult by proportion.

# 2x3 Contingency Table for Course Vs Rank
unit_by_diff = table(clean_data$unit,clean_data$data2002_rank)
unit_by_diff
##                      
##                       Easy Standard Difficult
##   DATA2002              57       10       107
##   DATA2902 (Advanced)    3        6        26

5.2.1.2 Plots

We can see the proportions of the course ranks are similar to the impression

mosaicplot(unit_by_diff)

Assumptions

  1. The expected frequencies, \(e_i\) = \(np_i\) \(\ge\) 5
  2. Independent observations and were randomly sampled

From previous sections, it is known that the data sample was not random and participants were self-chosen from the DATA2x02 student population. Furthermore calculating the expected cell counts assumption 1 is also not satisfied. As we have value/s < 5. So using the chi-squared distribution to compare the test statistic may not be valid.

suppressWarnings(chisq.test(unit_by_diff,correct = FALSE)$expected)  %>%
    round(1)
##                      
##                       Easy Standard Difficult
##   DATA2002              50     13.3     110.7
##   DATA2902 (Advanced)   10      2.7      22.3

Alternatively using a permutation test with Monte Carlo simulation instead of Fisher’s Exact test as it assumes the row and columns are fixed and sample size is large. This method is viable as no assumptions are made about the underlying distribution of the population.

set.seed(134)

# Monte Carlo Test by Simulation
p_test = chisq.test(unit_by_diff, simulate.p.value = TRUE, B = 10000)

paste("The resulting test statistic being",round(p_test$statistic,2), "and Monte Carlo p-value",round(p_test$p.value,5) )
## [1] "The resulting test statistic being 11.63 and Monte Carlo p-value 0.003"

Hypothesis Framework

1: Hypothesis: \(H_0:\) course group independent of difficulty rank vs \(H_1:\) course group and difficulty rank are dependent.

2: Assumption: Under Monte Carlo Simulation no assumptions are made of the distribution

3: Test statistic: \(t_0=\sum_{i=1}^r\sum_{j=1}^c\frac{(y_{ij}-y_{i\bullet}y_{\bullet j}/n)^2}{y_{i\bullet}y_{\bullet j}/n}\) Note: Degrees of Freedom is NA

4: Observed test statistic: \(t_0=\) 11.63

5: p-value: Monte Carlo p value is 0.003. With 10,000 iterations

6: Decision: Since the p-value is less than 0.05, the data provide evidence against \(H_0\). There is evidence to suggest that there is an association between course unit and difficulty rank.

# For Reference purposes
fisher_pval = fisher.test(unit_by_diff)$p.value

t = as.data.frame(unit_by_diff)
colnames(t) = c("Unit","Difficulty_Rank", "Freq")
unit_p = ggplot(t, aes(fill=Difficulty_Rank, y=Freq, x=Unit)) +
    geom_bar(position="stack", stat="identity") + coord_flip() + theme_light()
plotly::ggplotly(unit_p)

With the difference in observed proportion of difficulty ranks among the courses above. We can support our resulting conclusion holds when comparing the p-value of fisher’s test which gave 0.002 which was below the 0.05 significance level aswell.

5.3 Question 3 - Is the height of the DATA2x02 students affected by the difficulty of courses and the resulting stress.

Motivation - From, the previous test sufficient evidence towards the association between course and difficulty rank was observed. Given several studies and conceptions suggest stress influence stunted height especially in the developing years of adolescence (Dwi Oktari Erfanti, Djatnika Setiabudi, Kusnandi Rusmil, 2016).

Assuming every student participant of the survey sample undertaking an advanced or mainstream of the course is likely to opt for/has advanced courses. Does the sample data the two Independent populations of DATA2X02 reflect the patterns described by the studies/conceptions around stunted height?

We see here that the graphs below suggest the majority of stress responses lie not in the extremities of 1 and 10 but rather lie in the range from 3 to 8, with the total counts being DATA2002 and DATA2902 being roughly left-skewed. Indicating that majority of stress values are just > 5. Now conducting Two Sample T test to observe if this influences height.

da2 = clean_data %>%   na.omit() %>%  ggplot() + aes(stress_in_Wk) +
  geom_histogram(binwidth = 1,fill= c("light blue")) + facet_wrap(~unit) + 
  labs(title = "Stress Responses among Courses",
       caption = "(Note the stress responses were for 
                  a week of the course duration)",
        x = "Stress (1-10)", y = ("Frequency") ) + theme_bw()

# Majority of Stress responses above 5 - DATA2902
res1 = clean_data %>% filter(unit == "DATA2002", stress_in_Wk > 5) %>% nrow()  > clean_data %>% filter(unit == "DATA2002", stress_in_Wk < 5) %>% nrow()
# Majority of Stress responses above 5 - DATA2902
res2 = clean_data %>% filter(unit == "DATA2902 (Advanced)", stress_in_Wk > 5) %>% nrow()  > clean_data %>% filter(unit == "DATA2902 (Advanced)", stress_in_Wk < 5) %>% nrow()

# plotly::ggplotly(da2)
da2

Assumptions for the Two Sample T-test

  1. Two samples are Identical and Independent
  2. The samples under investigation follow a normal distribution \(N(\mu_X,\sigma^2)\)
  3. The two underlying normal populations have the same variance.

Given that in the data collection the surveyor was independent and the chance of students form either courses undertaking the survey was the same as them taking it at the same time (by definition of independence). Therefore assumption two is satisfied.

Furthermore under the Central Limit Theorem, a sample size > 30 will result in sampling distribution which will approximate a normal distribution. Since we have 174 DATA2002 and 36 DATA2902 students assumption two is satisfied.

Conducting a F-test (below), it was observed that the difference between the two variances was negligible as the resulting p-value was greater than the significance level 0.05, hence no significant difference between the two variances. assumption 3 also stands.

data20df =  clean_data %>% filter(unit == "DATA2002") %>% select(height_clean)
data29df =  clean_data %>% filter(unit == "DATA2902 (Advanced)") %>% select(height_clean)
sd29 = round(sd(data29df$height_clean, na.rm = TRUE),2)
sd20 =  round(sd(data20df$height_clean, na.rm = TRUE),2)

# F-TEST 
# H_0 : variances differ
# H_1 : No significant difference between the two variances.
var_test = var.test(data20df$height_clean, data29df$height_clean)$p.value 
paste("Variances of both samples", sd20, ", ", sd29 )
## [1] "Variances of both samples 8.8 ,  7.78"

Hypothesis Framework

Let \(X_i\) be the heights for the \(i^{th}\) DATA2002 student and \(Y_j\) the heights for the DATA2902 students. Let \(\mu_S\) and \(\mu_N\) be the population mean of the heights for DATA2002 and DATA2902 students respectively.

1: Hypothesis: \(H_0\colon\ \mu_{S}=\mu_{N}\) vs \(H_1\colon\ \mu_{S} \ne \mu_{N}\)

2: Assumptions: \(X_1,...,X_{n_x}\) are identical and independent \(N(\mu_X,\sigma^2)\), \(Y_1,...,Y_{n_y}\) are iid \(N(\mu_X,\sigma^2)\) and \(X_i\)’s are independent of \(Y_i\)’s

3: Test statistic: \(T = \dfrac{{\bar X} - {\bar Y}}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\) where \(S^2_p = \dfrac{(n_x-1) S_{x}^2 + (n_y-1) S_{y}^2}{n_x+n_y-2}\) . Under \(H_0\), \(T \sim t_{n_x+n_y-2}\)

4: Observed test statistic: \(t_0=\)

# Two Sample T-test Calculations
 t_test = t.test(data29df$height_clean,data20df$height_clean, alternative = "two.sided", var.equal = TRUE)

paste('The resulting test statistic being, ', round(t_test$statistic,3) )
## [1] "The resulting test statistic being,  1.172"

5: p-value: \(2P(t_{199} \ge |\) 1.172 |) = 0.242

6: Decision: The data are consistent with \(H_0\). There does not appear to be evidence that heights are different for students undertaking DATA2902.

In summary, it seems that there appears to be insufficient evidence to retain the conception that that stress may influence height of students undertaking difficult courses.

However, its important to mention that the alternative hypothesis was built under the limitation that the stress responses were for only a week for a specific course in there degree. Also, the assumption that stress may influence stunted height.

5.4 Question 4 - Does the course taken influence the their expectations on salary?

Motivation - Given that insufficient evidence was observed in finding any physical implications of studying harder/advanced courses that lead to increased stress. We shift our attention to the expectations that they may have on salary in relation to the two subgroups of DATA2X02.

5.4.1 Deciding on the Test

5.4.1.1 Finding Suitable Assumptions - T-test

Similar to the previous investigation, a equal variance assumption was tested to be found unsatisfied due to the p-value being lower than the 0.05 significance interval. This makes the Two sample t-test an unsuitable method for our cause.

data20df1 =  clean_data %>% filter(unit == "DATA2002") %>% select(salary)
data29df2 =  clean_data %>% filter(unit == "DATA2902 (Advanced)") %>% select(salary)

var_test1 = var.test(data20df1$salary, data29df2$salary)$p.value 

#Cannot Use Two Sample T-test 
paste('The resulting F-test for variance giving a p-value: ', round(var_test1,3))
## [1] "The resulting F-test for variance giving a p-value:  0.001"

5.4.1.2 Finding Suitable Assumptions - Welch

However alternatively using a Welch Two-Sample T-test the equal population variances is no longer concerned. However against a boxplot we find normality within the salary data is insufficient with evidence of skewed distributions with large but reasonable in context salary responses.

data20df1 =  clean_data %>% na.omit() %>% filter(unit == "DATA2002") %>% select(salary)
data29df2 =  clean_data %>% na.omit() %>% filter(unit == "DATA2902 (Advanced)") %>% select(salary)

data20df1$e <- 'Data2002'
data29df2$e <- 'Data2902'

# combine the two data frames etapa1 and etapa2
combo <- rbind(data20df1, data29df2)

plot1 = ggplot(combo) + aes(x = "", y = salary) +
  geom_boxplot(fill = "light blue") + facet_wrap(~e) + labs( title = "Salary Expectations" , y= " ", x= " ") + scale_y_continuous(labels = scales::comma) + theme_bw()

plotly::ggplotly(plot1) %>% layout(plot_bgcolor='#e5ecf6',
         yaxis = list(title=list(text='Annual Salary', font = list(size = 15), standoff = 18)))

Instead we use the Wilcoxon rank-sum test to further our investigation as the normality assumption is relaxed in this case.Hence, compare means of two independent salary samples of DATA2x02.

Assumptions for the Wilcoxon rank-sum test

  1. Two samples are Independent
  2. Assuming the two samples follow a similar distribution

Similar to the previous test the chance of students from either courses undertaking the survey was the same as them taking it at the same time (by definition of independence). Therefore assumption one is satisfied.

Looking at the distribution of the expectation of salary between the two groups we observe a clear majority of students expecting a larger salary comparison to the Data2002 cohort.

Furthermore the two samples may not be normal however they are both skewed similarly skewed with any inconsistencies occurring due to outliers in the data which are filtered in the test.Therefore assumption 2 stands as well.

data20df1 =  clean_data %>% na.omit() %>% filter(unit == "DATA2002", salary > 10000 && salary < 500000) %>% select(salary)
data29df2 =  clean_data %>% na.omit() %>% filter(unit == "DATA2902 (Advanced)", salary > 10000 && salary < 500000) %>% select(salary)

data20df1$e <- 'Data2002'
data29df2$e <- 'Data2902'

# combine the two data frames etapa1 and etapa2
combo <- rbind(data20df1, data29df2)

p3 = ggplot(combo, aes(salary, fill = e)) + geom_density(alpha = 0.2) +labs( title = "Destribution Densities of Salary Expectations" , y= " ", x= "Annual Salary", fill = "Courses")   + scale_x_continuous(name="Annual Salary", labels = scales::comma) +scale_y_continuous( labels = scales::comma) + theme_bw()
plotly::ggplotly(p3) %>% layout(
         yaxis = list(title=list(text='Density', font = list(size = 15), standoff = 18)))

Hypothesis Framework

Let \(A_i\) be the salary for the DATA2002 student cohort and Let \(B_i\) be the salary for the DATA2902 student cohort Let \(\mu_A\) and \(\mu_B\) be the population mean of the salary expectations for DATA2002 and DATA2902 students respectively.

1: Hypothesis: \(H_0\colon\ \mu_{A}=\mu_{B}\) vs \(H_1\colon\ \mu_{S} \ne \mu_{N}\)

2: Assumptions: \(A_i\) and \(B_i\) are independent and follow the same kind of distribution but differ by a shift.

3: Test statistic: \(W = R_1 + R_2 + \ldots + R_{n_A}\) Under \(H_0\), W follows the WRS( 113 , 29 ) distribution.

# Wilcoxon Test Calculations 
w_A = data20df1 %>% 
  mutate(rank = rank(salary)) %>%  pull(rank) %>%  sum()
wil_test = wilcox.test(data20df1$salary,data29df2$salary, correct = FALSE)

4: Observed test statistic: \(t_0=\) 6441

5: p-value: \(2 P(W \ge w) =\) 0.017 because \(w =\) 6441 \(> \frac{n_A(N+1)}{2} =\) 6399.5 so we look in the upper tail.

6: Decision: As the p-value is less than 0.05, the data is inconsistent with \(H_0\).Which means there is sufficient evidence to state that the population means of the salary expectations are different for 0.05 significance level. However the varying sample size and Wilcoxon rank-sum test’s sensitivity to differences in the shape of data may have reduced the power of the test.

6 Conclusions & Limitations

In conclusion, each individual investigation has provided an insight into the impact of Advanced and Main Stream Courses DATA2x02 and their influence on the Mental and Physical Expectations and Experiences of students.Initial intuitions aimed at exploring the context of the survey and its immediate impact on the study being COVID. The survey’s COVID tests and vaccination status responses were direct indicators of any association on the results. However as insufficient evidence for the Possion COVID test distribution was observed, goals were shifted to inference tests among the two sub-groups.

The characteristics of population of students in DATA2x02 Cohort is implausible to generalise with the survey data considering the non-response bias as a result of the non-random sampling method. However, attempts at simulating the sample data against tested assumptions were prospective in providing an outlook towards the underlying expectations and experiences among students of DATA2x02. As DATA20X02 students tend to find courses increasingly difficult with higher expectations of salary.